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Abstract. In the context of the genome-wide association studies (GWAS), one has to solve 
long sequences of generalized least-squares problems; such a task has two limiting factors: 
execution time -often in the range of days or weeks- and data management -data sets in 
the order of Terabytes. We present an algorithm that obviates both issues. By pipelining 
the computation, and thanks to a sophisticated transfer strategy, we stream data from hard 
disk to main memory to GPUs and achieve sustained peak performance; with respect to a 
highly-optimized CPU implementation, our algorithm shows a speedup of 2.6x. Moreover, the 
approach lends itself to multiple GPUs and attains almost perfect scalability. When using 4 
GPUs, we observe speedups of 9x over the aforementioned implementation, and 488x over a 
widespread biology library. 

Keywords: GWAS, generalized least-squares, computational biology, out-of-core computa- 
tion, high-performance, multiple GPUs, data transfer, multibuffering, streaming, big data 



1 GWAS, their Importance and Current Implementations 

In a nutshell, the goal of a genome-wide association study (GWAS) is to find an association between 
genetic variants and a specific trait such as a disease Ij. Since there is a tremendous amount of such 
genetic variants, the computation involved in GWAS takes a long time, ranging from days to weeks 
and even months [2 . In this paper, we look at OOC-HP-GWAS, currently the fastest algorithm 
available, and show how it is possible to speed it up by exploiting the computational power offered 
by modern graphics accelerators. 

The solution of GWAS boils down to a sequence of generalized least squares (GLS) problems 
involving huge amounts of data, in the order of Terabytes. The challenge lies in sustaining GPU's 
performance, avoiding idle time due to data transfers from hard disk (HDD) and main memory. 
Our solution, cuGWAS, combines three ideas: the computation is pipelined through GPU and CPU, 
the transfers are executed asynchronously, and the data is streamed from HDD to main memory 
to GPUs by means of a two-level buffering strategy. Combined, these mechanisms allow cuGWAS 
to attain almost perfect scalability with respect to the number of GPUs; when compared to OOC- 
HP-GWAS and another widespread GWAS library, our code is respectively 9 and 488 times faster. 

In the first section of this paper, we introduce the reader to GWAS and the computations 
involved therein. We then give an overview of OOC-HP-GWAS, upon which we build cuGWAS, 
whose key techniques we explain in Section 3 and which we time in Section 4. We provide some 
closing remarks in Section 5. 



1.1 Biological Introduction to GWAS 



The segments of the DNA that contain information about protein synthesis are called genes. They 
encode so-called traits, which are features of physical appearance of the organism -like eye or hair 
color- as well as internal features of the organism -like blood type or resistances to diseases. The 
hereditary information of a species consists of all the genes in the DNA, and is called genome; this 
can be visualized as a book containing instructions for our body. Following this analogy, the letters 
in this book are called nucleotides, and determining their order is referred to as sequencing the 
genome. Even though the genome sequence of every individual is different, within one species most 
of it (99.9% for humans) stays the same. When a single nucleotide of the DNA differs between two 
individuals of the same species, this difference is called a single-nucleotide polymorphism {SNP, 
pronounced "snip" ) and the two variants of the SNP are referred to as its alleles. 

Genome-wide association studies compare the DNA of two groups of individuals. All the indi- 
viduals in the case group have a same trait, for example a specific disease, while all the individuals 
in the control group do not have this trait. The SNPs of the individuals in these groups are com- 
pared; if one variant of a SNP is more frequent in the case group than in the control group, it is 
said that the SNP is associated with the trait (disease). In contrast with other methods for linking 
traits to SNPs, such as inheritance studies or genetic association studies, GWAS consider the whole 
genome [T]. 



1.2 The Importance of GWAS 

We gathered insightful statistics about all published GWAS [3]. Since the first GWAS started to 
appear in 2005 and 2006, the amount of yearly published studies has constantly increased, reaching 
more than 2300 studies in 2011. This trend is summarized in the left panel of Fig. [T] showing the 
median SNP-count of each year's studies along with error-bars for the first and second quartiles. 
One can observe that while GWA studies started out relatively small, since 2009 the amount of 
analyzed SNPs is growing tremendously. Besides the number of SNPs, the other parameter relevant 
to the implementation of an algorithm is the sample size, that is the total number of individuals 
of both the case and the control group. What can be seen in Fig. [TJj is that while it has grown at 
first, in the past four years the median sample size seems to have settled around 10 000 individuals. 
It is apparent that, in contrast to the SNP count, the growth of the sample size is negligible. This 
data, as well as discussions with biologists, confirm the need for algorithms and software that can 
compute a GWAS with even more SNPs, and faster than currently possible. 

1.3 The Mathematics of GWAS 

The GWAS can be expressed as a variance component model 4J whose solution can be formulated 
as 

r, = {xJm-^X,)-^xJm-^v, % = l..m , (1) 

where m is in the millions and all variables on the right-hand side are known. This sequence 
of equations is used to compute in the relations between variations in y (the phenotyp^ and 

^ A phenotype is the observed value of a certain trait of an individual. For example, if the studied trait 
was the hair color, the phenotype of an individual would be the one of "blonde", "brown", "black" or 
"red" . 
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Fig. 1: The median, first and second quartile of a) the SNP-count and b) the sample size of the 
studies each year. 



variations in Xi (the genotype). Each equation is responsible for one SNP, meaning that the number 
TO of equations corresponds to the number of SNPs considered in the study. 

Figure [2] captures the dimensions of the objects involved in one such equation. The height n of the 




Fig. 2: The dimensions of a single instance of (IT 



matrices Xi and M and of the vector y corresponds to the number of samples, thus each row in the 
design-matrix Xi e M"'^^ corresponds to a piece of each individual's genetic makeup (i.e. information 
about one SNP), and each entry in y e M" corresponds to an individual's phenotype]^ A/ e M'"" 
models the relations amongst the individuals, e.g. two individuals being in the same family. Finally, 
an important feature of the matrices Xi is that they can be partitioned as (Xi|X/j. ), where X^ 
contains fixed covariates such as age and sex and thus stays the same for any i, while X^^ is a 
single column vector containing the genotypes of the i-th SNP of all considered individuals. 



^ In the example of the body height as a trait, the entries of y would then be the heights of the individuals. 



Even though ([T]) has to be computed for every single SNP, only the right part of the design- 
matrix Xfl. changes, while X^, M and y stay the same. 



1.4 The Amount of Data and Computation Involved 

We analyze the storage size requirements for the data involved in GWAS. Typical values for p range 



we 



between 4 and 20, but only one entry varies with m. According to our analysis in Section 1.2 
consider n = 10 000 as the size of a study. As of June 2012, the SNP database dhSNP hsts 187 852 828 
known SNPs for humans [5], so we consider m = 190 000000. With these numbers, assuming that all 
data is stored as double precision floating point numbers}^ Therefore, the size of y and M is about 
80 MB and 800 MB, respectively; both fit in main memory and in the GPU memory. The output r 
reaches 30 GB, coming close to the limit of current high-end systems' main memory and is too big 
to fit in a CPU's 6 GB of memory. Weighting in at 14 TB, X is too big to fit into the memory of 
any system in the foreseeable future and has to be streamed from disk. 

In the field of bioinformatics, the ProbABEL [6^ library is frequently used for genome-wide 
association studies. On a Sun Fire X4640 server with an Intel Xeon CPU 5160 (3.00 GHz), the 
authors report a runtime of almost 4 hours for a problem with p= A, n= 1500 and m = 220833, and 
estimate the runtime with m = 2 500000 to be roughly 43 hour^ -almost two days. Compared to 
the current demand, m = 2.5 million is a reasonable amount of SNPs, but a population size of only 
n = 1500 individuals is clearly much smaller than the present median (Fig. [T]). The authors state 
that the runtime grows more than linearly with n and, in fact, tripling up the sample size from 500 
to 1500 increased their runtime by a factor of 14. Coupling this fact with the median sample size 
of about 10 000 individuals, the computation time is bound to reach weeks or even months. 

2 Prior Work: the OOC-HP-GWAS Algorithm 

Presently, the fastest available algorithm for solving ([T]) is OOC-HP-GWAS [4]. Since our work 
builds upon this CPU-only algorithm, we describe its salient features. 

2.1 Algorithmic Features 

OOC-HP-GWAS exploits the the symmetry and the positive definiteness of the matrix M, by 
decomposing it through a Cholesky factorization LL^^ = M. Since M does not depend on i, this 
decomposition can be computed once as a preprocessing step and reused for every instance of ([T]). 
Substituting LL'^ = M into ([l]) and rearranging, we obtain 

n = {{L-^X,f L-^X,)-\L-^X,f L-^y for i = l..m , (2) 

Xi X, X, y 

effectively replacing the inversion and multiplication of M with the solution of a triangular linear 
system (trsv). 



^ Which may or may not be the optimal storage type. More discussion with biologists and analysis of the 
operations is necessary in order to find out whether float is precise enough. If that was the case, the 
sizes should be halved. 

■* We only consider what the authors called the linear model with the — mmscore option as this solves the 
exact problem we tackle. 



The second problem-speeific piece of knowledge that is exploited by OOC-HP-GWAS is the 
structure of X = (XlIXji): Xl stays constant for any i, while Xr varies; plugging Xi = (XlIXh.) 
into ([2]) and moving the constant parts out of the loop leads to an algorithm that takes advantage 
of the structure of the sequence of GLS shown in Listing The acronyms correspond to BLAS 
calls. 

Listing 1.1: Solution of the GWAS-specific sequence of GLS M. 



1 


L 


<- potrf M 


(LL^ = M) 


2 


XI 


<- trsm 


L, XI 


{Xl = L-^Xl) 


3 


Y 


*- trsv 


L, y 


{y = L-^y) 


4 


rt 


*- gemv 


XI, y 


{fr = Xly) 


5 


Stl 


<- syrk 


XI 


{Stl = XlXl) 


6 


for 


i in 1 


.m: 




7 




Xri *- 


trsv L, Xri 


{Xr^=L-'Xr^) 


8 




Sbl 


dot Xri, XI 


{SsLi = XJi.Xl) 


9 




Sbr ^ 


syrk Xri 


{XsRi = XJi.Xr-) 


10 




rb 


dot Xri, y 


{fB,=Xl^y) 


11 




r 


posv S, r 


{n = s;'r,) 



2.2 Implementation Features 

Two implementation features allow OOC-HP-GWAS to attain near-perfect efficiency. First, by 
packing multiple vectors Xfj. into a matrix X^^, the slow BLAS-2 routine to solve a triangular 
linear system (trsv) at Line 7 can be transformed into a fast BLAS-3 trsm. Then, Listing [l.l| 
is an in-core algorithm that cannot deal with an X^ which does not fit into main memory. This 
limitation is overcome by turning the algorithm into an out-of-core one, in this case using a double- 
buffering technique: While the CPU is busy computing the block b of Xn in a primary buffer, the 
next block 6+1 can already be loaded into a secondary buffer through asynchronous I/O. The full 



OOC-HP-GWAS algorithm is shown in Listing 1.2 This algorithm attains more than 90% efficiency. 



Listing 1.2: The full OOC-HP-GWAS algorithm. 

L *- potrf M (LL^ = M) 

XI trsm L, XI {Xl=L'^Xl) 

y *- trsv L, y {y = L'^y) 

rt -i- gemv XI, y {vt = X]]y) 

Stl *- syrk XI {Stl = XIXl) 
aio_read Xr[l] 
for b in 1 . . blockcount : 

aio_reaQ Xr[b+1] 

aio_wait Xr [b] 

Xrb ir- trsm L, Xrb {X^ = L"^X[,) 
for Xri in Xr [b] : 

Sbl *- gemm Xri, XI {Sbl, = X^i^Xl) 

Sbr *- syrk Xri {Xbr^ = XJ^.Xr.) 

rb ^ gemv Xri, y (fs^ = ^J^.y) 



15 r <^ posv S, r (r^ = Si^fi) 

16 aio_wait r[b-l] 

17 aio_wr it - r [b] 

18 aio_wait r [blockcount ] 



3 Increasing Performance by Using GPUs 

While the efficiency of the OOC-HP-GWAS algorithm is satisfactory, the computations can be sped 
up even more by leveraging multiple GPUs. With the help of a profiler, we determined (confirming 



the intuition), that the trsm at line fO in Listing 1.2 is the bottleneck. Since cuBLAS provides a 
high-performance implementation of BLAS-3 routines, trsm it is the best candidate to be executed 
on GPUs. In this section, we introduce cuGWAS, an algorithm for a single GPU, and then extend 
it to an arbitrary number of GPUs. 

Before the trsm can be executed on a GPU, the algorithm has to transfer the necessary data. 
Since the size of L is around 800 MB, the matrix can be sent once during the preprocessing step and 
kept on the GPU throughout the entire computation. Unfortunately, the whole Xn matrix weights 
in at several TB, way more than the 2 GB per buffer limit of a modern GPU. The same holds true 
for the result X^j^ of the trsm, which needs to be sent back to main memory. Thus, there is no 
other choice than to send it in a block- by-block fashion, each block AT^^ weighting at most 2 GB. 

When profiled, a naive implementation of the algorithm displays a pattern (Fig. [S]) typical for 
applications in which GPU-ofHoading is an after-thought: both GPU (green) and CPU ( , < ) need 
to wait for the data transfer (orangi ); furthermore, the CPU is idle while the GPU is busy and 
vice- versa. 

Fig. 3: Profiled timings of the naive implementation. 



Our first objective is to make use of the CPU while the GPU computes the trsm. Regrettably, 
all operations following the trsm (i.e. the for-loop at Lines 11-15 in Listing 1.2 which we will call 
the S-loop) are dependent on its result and thus cannot be executed in parallel. A way to break 
out of this dependency is to delay the S-loop by one block, in a pipeline fashion, so that the S-loop 
relative to the 6-th block of Xn is delayed and executed on the CPU, while the GPU executes 
the trsm with the (6+l)-th block. Thanks to this pipeline, we have broken the dependency and 
introduced more parallelism, completely removing the gray part of Fig. |3] 



3.1 Streaming Data from HDD to GPU 

The second problem with the aforementioned naive implementation is the time wasted due to data 
transfers. Modern GPUs are capable of overlapping data transfers with computation. If properly 
exploited, this feature allows us to eliminate any overhead, and thus attain sustained peak perfor- 
mance on the GPU. 

The major obstacle is that the data is already being double-buffered from the hard-disk to the 
main memory. A quick analysis shows that when targeting two layers of double-buffering (one layer 



for disk main memory transfers and another layer for main memory GPU transfers), two 
buffers on each layer are not sufficient anymore. The idea here is to have two buffers on the GPU 
and three buffers on the CPU. 

The double-triple buffering can be illustrated from two perspectives: the tasks executed and 
the buffers involved. The former is presented in Fig. |4j we refer the reader to [7 for a thorough 
description. Here we only discuss the technique in terms of buffers. 
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Fig. 4: A task-perspective of the algorithm. Sizes are unrelated to runtime. 



In this single-GPU scenario, the size of the blocks Xr^ used in the GPU's computation is equal 
to that on the CPU. When using multiple GPUs, this will not be the case anymore, as the CPU 
loads one large block and distributes portions of it to the GPUs. 

The GPU's buffers are used in the same way as the CPU's buffers in the simple CPU-only 
algorithm: While one buffer a is used for the computation, the data is transferred to and from the 
other buffer /3. But on the CPU's level (i.e. in RAM), three buffers are now necessary. For the sake 
of simplicity, we avoid the explanation of the initial and final iterations and start with iteration b. 

With reference to Fig. 5a assume that the (5-l)-th, fe-th and (6+I)-th blocks already reside in 
the GPU buffers /3, a, and in the CPU buffer C, respectively. The block 6-1 (i.e. buffer /3) contains 
the solution of the trsm of block 6-1. At this point, the algorithm proceeds by dispatching both 
the read of the second-next block 6 + 2 from disk into buffer A and the computation of the trsm 
on the GPU on buffer a, and by receiving the result from buffer 13 into buffer B. The first two 
operations are dispatched, i.e. they are executed asynchronously by the memory system and the 
GPU, while the last one is executed synchronously because these results are needed immediately in 
the following step. 

As soon as the synchronous transfer f3 ^ B completes, the transfer of the next block 6+1 from 
CPU buffer C to GPU buffer /3 is dispatched, and the S-loop is executed on the CPU for the previous 
block 6-1 in buffer B on the CPU (see Fig. 5b ) . 
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Fig. 5: The multi-buffering algorithm as seen from a buffer perspective. 



As soon as the CPU is done computing the S-loop, its results are written to disk (Fig. 5c I. 
Finally, once all transfers are done, buffers are rotated (through pointer or index rotations, not 
copies) according to Fig. 5d and the loop continues vifith b ^ b + 1. 



3.2 Using Multiple CPUs 



This multi-buffering technique achieves sustained peak performance on one GPU. Since boards with 
many GPUs are becoming more and more common in high-performance computing, we explain 
here how our algorithm is adapted to take advantage of all the available parallelism. The idea is to 
increase the size of the X/j^ blocks by a factor as big as the number of available GPUs, and then split 
the trsm among these GPUs. As long as solving a trsra on the GPU takes longer than loading a 
large enough block Xji^ from HDD to CPU, this parallclization strategy holds up to any number of 
GPUs. Since in our systems loading the data from HDD was an order of magnitude faster than the 
computation of the trsm, the algorithm scales up to more GPUs than were available. Listing [l.3| 
shows the final version of cuGWASH 



^ The conditions for the first and last pair of iterations are provided in parentheses on the right. 



4 Results 



In order to show the speedups obtained with a single GPU, we compare the hybrid CPU-GPU 
algorithm presented in Listing [O] using one GPU with the CPU-only OOC-HP-GWAS. Then, to 
determine the scalability of cuGWAS, we compare its runtimes when leveraging 1, 2, 3 and 4 GPUs. 

In all of the timings, the time to initialize the GPU and the preprocessing (Lines 1-7 in List- 
ing 1.3), both in the order of seconds, have not been measured. The GPU usually takes 5 s to fully 
initialize, and the preprocessing takes a few seconds too, but depends only on n and p. This omission 
is irrelevant for computations that run for hours. 



Listing 1.3: cuGWAS. The black bullet is a placeholder for "aU GPUs" 



1 


L 


<- potrf M 




(ii^ = M) 


2 


cublas_send L L_gpu. 






3 


XI 


*- trsm L, XI 






4 


y 


trsv L, y 






5 


rt 


■i- gemv XI, y 




{fT=Xly)_ 


6 


Stl 


syrk XI 




{Stl = X^Xl) 


7 


gpubs blocksize/ngpus 






8 


for 


b in -1 . . blockcount+1 : 






9 




cu_trsm_wait Q. 




(if b in 1 . . blockcount ) 


10 




cu_send_wait C. /3. 




(if b in 2 .. blockcount+1 ) 


11 




a, *- cu_trsm_async L_gpu 


. , a. 


(if b in 1 .. blockcount (Xi, 


12 




aio_read Xr[b+2] A 




(if b in -1 . . blockcount-2 


13 




for gpu in . . ngpus : 




(if b in 2 .. blockcount + 1 ) 


14 




cu_recv B[gpu*gpubs 


. (gpu+1) *gpubs] ^ Pgpu 


15 




aio.wait Xr[b+1] -* C 




(if b m . . blockcount-1 ) 


16 




for gpu in 0.. ngpus: 




(if b in .. blockcount-1 ) 


17 




cu_send_async C[gpu* 


gpubs 


. (gpu+1) *gpubs] -> /3gpu 


18 




for Xri in B: 




(if b m 2 .. blockcount + 1 ) 


19 




Sbl ^ gemm Xri, XI 




(SsLi = XJi.Xl) 


20 




Sbr syrk Xri 




(XBRi = XJi.Xr.) 


21 




rb gemv Xri, y 




{fB,=Xl^y) 


22 




r ^ posv S, r 




in = S-^h) 


23 




aio_wait r [b-2 ] 




(if b in 1 .. blockcount + 1 ) 


24 




aio_writ - r [b-1 ] 




(if b in 1 .. blockcount + 1 ) 


25 




swap_buf f er s 







: L-^Xb) 



4.1 Single-GPU Results 

The experiments with a single-GPU were performed on the Quadro cluster at the RWTH Aachen 
University; the cluster is equipped with two nVidia Quadro 6000 GPUs and two Intel Xeon X5650 
CPUs per node. The GPUs, which are powered by Fermi chips, have 6 GB of RAM and a theoretical 
double-precision computational power of 515GFlops each. In total, the cluster has a GPU peak 
of 1.03TFlops. The CPUs, which have six cores each, amount to a total of 128 GFlops and are 
supported by 24 GB of RAM. The cost of the combined GPUs is estimated to about $10 000 while 
the combined CPUs cost around $2000. 



Figure [6^ shows the runtime of OOC-HP-GWAS along with that of cuGWAS, using one GPU. 
Thanks to our transfer-overlapping strategy, we can leverage the GPU's peak performance and 
achieve a 2.6x speedup over a highly-optimized CPU-only implementation. cuBLAS' trsm imple- 
mentation attains about 60% of the GPU's peak performance, i.e. about 309GFlops |5]. The peak 
performance of the CPU in this system amounts to 128 GFlops; if the whole computation were 
performed on the GPU at trsm's rate, the largest speedup possible would be 2.4. We achieve 2.6 
because the computation is pipelined: the S-loop is executed on the CPU, in perfect overlap with 
the GPU. This means that the performance of cuGWAS is perfectly in line with the theoretical 
peak. 

In addition, the figure indicates that the algorithm (1) has linear runtime in m and (2) allows 
us to cope with an arbitrary m. The red vertical line in the figure marks the largest value of m 
for which two blocks of fit into the GPU memory for n = 10 000. Without the presented multi- 
buffering technique, it would not be possible to compute GWAS with more than m = 22 500 SNPs, 
while cuGWAS allows the solution of GWAS with any given amount of SNPs. 



4.2 Scalability v^^ith Multiple CPUs 

To experiment with multiple CPUs, we used the Tesla cluster at the Universitat Jaume I in Spain, 
since it is equipped with an nVidia Tesla S2050 which contains four Fermi chips (same model as 
the Quadro system), for a combined GPU compute power of 2.06TFlops, but with only 3 GB of 
RAM each. The host CPU is an Intel Xeon E5440 delivering approximately 90 GFlops. 

In order to evaluate the scalability of cuGWAS, we solved a GWAS with p = 4, n = 10000, 
and m = 100 000 on the Tesla cluster, varying the number of GPUs. As it can be seen in Fig. |6jD, 
the scalability of the algorithm with respect to the number of GPUs is almost ideal: Doubling the 
amount of GPUs reduces the runtime by a factor of 1.9. 



a) Runtime with respect to SNP count m b) Scalability with respect to GPU count 




OK 22,5K 45K 67,5K 90K 12 3 4 

m (SNP count) Number of GPUs 

+ OOC-HP-GWAS O cuGWAS 1 GPU O cuGWAS Ideal scalability 



Fig. 6: The runtime of our cuGWAS algorithm a) using IGPU compared to OOC-HP-GWAS, using 
IGPU and b) using a varying amount of GPUs. 



5 Conclusion and Future Work 



We have presented a strategy which makes it possible to sustain peak performance on a GPU not 
only when the data is too big for the CPU's memory, but also for main memory. In addition, we 
have shown how well this strategy lends itself to exploit an arbitrary number of CPUs. 

As described by the developers of ProbABEL, the solution of a problem of the size described 
in Section [T^ by the GWFGLS algorithm took 4 hours. In contrast, with cuGWAS we solved the 
same problem in 2.88 s. Even accounting for about 6 seconds for the initialization and Moore's Law 
(doubling the runtime as ProbABEL's timings are from 2010), the difference is dramatic. We believe 
that the contribution of cuGWAS is an important step towards making GWAS practical. 



Software The code implementing the strategy explained in this paper is freely available at http : 
[//git hub . com/ lucasb-eyer/ cuGWAS, and .http : / / lucas-b . eyer .be[ 
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